Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MMseqs2 clustering and taxonomy #6574

Open
wants to merge 63 commits into
base: main
Choose a base branch
from

Conversation

hugolefeuvre
Copy link
Contributor

FOR CONTRIBUTOR:

  • I have read the CONTRIBUTING.md document and this tool is appropriate for the tools-iuc repo.
  • License permits unrestricted use (educational + commercial)
  • This PR adds a new tool or tool collection
  • This PR updates an existing tool or tool collection
  • This PR does something else (explain below)

Comment on lines 19 to 21
ls -lah '$createtaxdb.database_type.mmseqs2_db_select.fields.path'* &&
mmseqs createtaxdb
'$createtaxdb.database_type.mmseqs2_db_select.fields.path'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try:

Suggested change
ls -lah '$createtaxdb.database_type.mmseqs2_db_select.fields.path'* &&
mmseqs createtaxdb
'$createtaxdb.database_type.mmseqs2_db_select.fields.path'
ls -lah '$createtaxdb.database_type.mmseqs2_db_select.fields.path'* &&
ln -s '$createtaxdb.database_type.mmseqs2_db_select.fields.path' taxdb &&
mmseqs createtaxdb
'taxdb'

If needed add an extension to taxdb.

It seems that in the end this script
is executed and I guess that the trick should work.

Also wondering if we should download all the databases with every job or if we should create (or reuse existing reference data)? At least the ncbi taxonomy dump files should already be handled by a another data manager. Not sure about the mappings to uniprot.

Of course the option to allow users to provide their own mapping needs to be preserved.

Wondering if createdb and createtaxdb should be separate tools / data managers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also wondering if we should download all the databases with every job or if we should create (or reuse existing reference data)? At least the ncbi taxonomy dump files should already be handled by a another data manager. Not sure about the mappings to uniprot.

I don't know enough about this to be able to answer your question.

Wondering if createdb and createtaxdb should be separate tools / data managers.

We've already thought about it, but we thought it would be more interesting to be able to perform an end-to-end taxonomy analysis (fasta file to contig taxonomy) with a single galaxy module. But it can be discussed if you think it might be useful for other uses.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the main question is if the output of createdb and createtaxdb can be used by multiple tools / reused later.

There are also disadvantages wrt reproducibility when the current DB is used. But of course it also has disadvantages.

My intuition tells me to split it. Also because in jobs that mainly download data need to be handled differently on compute systems. This is not possible if compute and download tasks are mixed in a tool.

But this is only my intuition and I leave this to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, one of the major problems with separating createdb and createtaxdb is the output from these 2 modules. There are several reasons for this:

  • The output format is not supported by Galaxy
  • createdb creates several files which must be used together (in a collection at the very most)
  • createtaxdb creates a file which must be integrated with the other files in order to be able to use the DB as a taxDB, it can take as input either the files generated by createdb, or the files from a DB which doesn't have a taxo.

In other words, createtaxdb might not be in the actual xml because the databases I'm giving it as input at the moment are only the taxonomy DBs proposed by MMseqs, but there are other DM mmseqs2 data_tables that don't have a taxonomy and for which createtaxdb could be useful (https://github.com/soedinglab/mmseqs2/wiki#downloading-databases).

This is not possible if compute and download tasks are mixed in a tool

Could this be why I'm having a problem in my test?
Finally, I don't really know what to do. On the one hand, I think it's a good idea to split it up, as this allows more flexibility for the user, but on the other hand, it seems more complicated to set up and less intuitive for the user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be why I'm having a problem in my test?

Unlikely. The problem seems to be just that the tool tries to write next to the input files (the databases) .. which is not possible. If the symlink trick does not work the only option seems to be to copy the database to the working directory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try later by putting the database in a directory and making a symlink of that directory. Because the way the symlink was created only took into account the Swiss-Prot file in the database, but there are others (Swiss-Prot_h, Swiss-Prot_index...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The directory datatype might be a solution for the datatype questions.

Finally, I don't really know what to do.

Follow your intuition. Your arguments are perfectly reasonable. And we can still improve this later.

Copy link
Contributor

@bernt-matthias bernt-matthias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a first quick review of the tools. Thanks already for the massive amount of work.

tools/mmseqs2/mmseqs2_taxonomy_assignment.xml Outdated Show resolved Hide resolved
tools/mmseqs2/mmseqs2_taxonomy_assignment.xml Outdated Show resolved Hide resolved
tools/mmseqs2/mmseqs2_taxonomy_assignment.xml Outdated Show resolved Hide resolved
tools/mmseqs2/mmseqs2_easy_linclust_clustering.xml Outdated Show resolved Hide resolved
tools/mmseqs2/mmseqs2_easy_linclust_clustering.xml Outdated Show resolved Hide resolved
tools/mmseqs2/mmseqs2_taxonomy_assignment.xml Outdated Show resolved Hide resolved
tools/mmseqs2/mmseqs2_taxonomy_assignment.xml Outdated Show resolved Hide resolved
tools/mmseqs2/mmseqs2_taxonomy_assignment.xml Outdated Show resolved Hide resolved
tools/mmseqs2/mmseqs2_taxonomy_assignment.xml Outdated Show resolved Hide resolved
@hugolefeuvre
Copy link
Contributor Author

It seems that the tests are not currently running because of problems with MMseqs2, in particular when downloading the taxonomy part for Uniprot (as the SILVA test is running correctly). I'll wait until the problem is solved, if it takes too long then I'll use other test databases

This reverts commit 09e0d94.

Back to macros parameters since tool issue has been resolved
@hugolefeuvre
Copy link
Contributor Author

Hi @bernt-matthias, does the PR require further, more in-depth reviews (I guess so), perhaps by other reviewers ?

Copy link
Contributor

@bernt-matthias bernt-matthias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had another look, mainly on the data manager.

Is there any relation between the nt and the tax-nt databases? For instance are there corresponding entries?

@hugolefeuvre
Copy link
Contributor Author

hugolefeuvre commented Dec 2, 2024

Is there any relation between the nt and the tax-nt databases? For instance are there corresponding entries?

What do you mean with relation ? If the databases in mmseqs2_nucleotide_databases and mmseqs2_nucleotide_taxonomy_databases are the same ? For example SILVA which is in both datatable ?
If yes, there is no difference between them. Databases that are in nt and not in tax-nt databases don't have taxonomic information and need createtaxdb if you want to use them as taxonomic database.

@bernt-matthias
Copy link
Contributor

One suggestion: If we can agree on a set of data tables and their columns we can also split the PR in two parts: one for the tool and one for the data manager. To me it seems that the datamanager will need a while and will slow down the progreee of the tool.

@hugolefeuvre
Copy link
Contributor Author

One suggestion: If we can agree on a set of data tables and their columns we can also split the PR in two parts: one for the tool and one for the data manager. To me it seems that the datamanager will need a while and will slow down the progree of the tool.

No problem if you think it's the best way to do it. I'm not an expert and I don't know much concerning all data tables already present in galaxy and what we can use and how to do it, but I'm curious to learn.
Questions :

  • Do I still need to create an issue on MMseqs repo to ask them about db versions ?
  • How can we agree on a set of data tables ?
  • The tool part needs a datamanager to function, so won't we have to wait until the DM is finished for the tool part to work? (just wonder why split the PR)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants